Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent get_<DATA_ASSET> permissions - S3_Datasets #1727

Merged
merged 16 commits into from
Dec 10, 2024
Merged

Conversation

dlpzx
Copy link
Contributor

@dlpzx dlpzx commented Dec 3, 2024

Feature or Bugfix

  • Feature/Bugfix

Detail

This PR is the first part of a series and only handles S3_Datasets and its child components Tables and Folders

Current implementation

Most API calls on a particular resource are restricted by GET_RESOURCE permissions. But for resources that are indexed in the Catalog, some metadata is considered public as it is useful for data consumers to discover and understand the data assets. Users will click on these resources from the Catalog view and call one of the following API calls:

  • getDataset
  • getDatasetStorageLocation
  • getDatasetTable
  • getDashboard
  • getRedshiftDataset
  • getRedshiftTable

From the above list, initially all are decorated with resource_permission checks except for getDataset and getDatasetTable.

Target implementation

  • Public information should be available for data consumers to explore, that means that we first need to remove the resource_permission checks from the list of APIs.
  • Not all information is public, to get AWS information and other restricted metadata we still need to verify GET_X permissions on the resource.
  • For restricted metadata, we should provide default messages that do not break the frontend

In addition in this PR some unused fields are removed and syncTables is simplified to return an integer with the count of synced tables

Relates

Testing

For each of the following I tested with a user with GET permissions and without GET permissions. FE views render and show information or unauthorized to view info placeholder

  • Dataset View, Dataset Edit Form and list Datasets
  • Dataset Data tab with list of tables and folders
  • Table view, Table Edit
  • Folder view and Folder Edit

Other checks

  • Complete share request
  • With requester check table and folder and view the details of the account...
  • Worksheets query with owned table
  • Worksheets query with shared table
  • Crawl dataset - correct S3 path
  • Sync tables - correct params

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)?
    • Is the input sanitized?
    • What precautions are you taking before deserializing the data you consume?
    • Is injection prevented by parametrizing queries?
    • Have you ensured no eval or similar functions are used?
  • Does this PR introduce any functionality or component that requires authorization?
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms?
    • Are you logging failed auth attempts?
  • Are you using or adding any cryptographic features?
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users?
    • Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dlpzx dlpzx changed the title Consistent get_<DATA_ASSET> permissions Consistent get_<DATA_ASSET> permissions - S3_Datasets Dec 3, 2024
@dlpzx dlpzx marked this pull request as ready for review December 4, 2024 08:14
@dlpzx dlpzx force-pushed the fix/data-asset-perms branch from 851f36a to 85f5fc4 Compare December 4, 2024 10:52
@dlpzx dlpzx requested a review from petrkalos December 4, 2024 13:54
Comment on lines +32 to +33
gql.Field(name='S3BucketName', type=gql.String),
gql.Field(name='GlueDatabaseName', type=gql.String),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: A call out that these 2 fields (S3BucketName and GlueDatabaseName) are a part of index in opensearch (backend/dataall/modules/s3_datasets/indexers/dataset_indexer.py) - not sure really how "restricted" they are treated if that information can be found elsewhere

gql.Field(name='IAMDatasetAdminRoleArn', type=gql.String),
gql.Field(name='KmsAlias', type=gql.String),
gql.Field(name='bucketCreated', type=gql.Boolean),
gql.Field(name='glueDatabaseCreated', type=gql.Boolean),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: making note we can remove the following from data model if we wanted to clean up sometime in the future

  • bucketCreated
  • glueDatabaseCreated
  • iamAdminRoleCreated
  • lakeformationLocationCreated
  • bucketPolicyCreated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to add a migration script because backfilling it in 2.6.2 was going to be a pain; but totally agree on the cleanup

fields=[
gql.Field(name='AwsAccountId', type=gql.String),
gql.Field(name='GlueDatabaseName', type=gql.String),
gql.Field(name='GlueTableName', type=gql.String),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to call out above that GlueDatabaseName is a part of index in opensearch (backend/dataall/modules/s3_datasets/indexers/table_indexer.py) - not sure really how "restricted" they are treated if that information can be found elsewhere

++ to note I would think best if anything would be to restrict on open search and keep the other server side restricted logic the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be the solution, restricting it also in OpenSearch

@@ -233,7 +233,7 @@ const FolderView = () => {
const fetchItem = useCallback(async () => {
setLoading(true);
const response = await client.query(getDatasetStorageLocation(params.uri));
if (!response.errors && response.data.getDatasetStorageLocation !== null) {
if (response.data.getDatasetStorageLocation !== null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update the error dispatch to dispatch all errors (1 or more)

@@ -69,7 +70,7 @@ export const DatasetFolders = (props) => {
const response = await client.query(
listDatasetStorageLocations(dataset.datasetUri, filter)
);
if (!response.errors) {
if (response.data.getDataset != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dispatch all errors

gql.Field(name='importedS3Bucket', type=gql.Boolean),
gql.Field(name='importedGlueDatabase', type=gql.Boolean),
gql.Field(name='importedKmsKey', type=gql.Boolean),
gql.Field(name='importedAdminRole', type=gql.Boolean),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we also want to include the following share expiration / config related information as part of restricted?

        gql.Field(name='autoApprovalEnabled', type=gql.Boolean),
        gql.Field(name='enableExpiration', type=gql.Boolean),
        gql.Field(name='expirySetting', type=gql.String),
        gql.Field(name='expiryMinDuration', type=gql.Integer),
        gql.Field(name='expiryMaxDuration', type=gql.Integer),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially yes, but this information is fetched when opening a share request. At the end the user needs it for the share request creation

@@ -161,7 +161,7 @@ const DatasetView = () => {
const fetchItem = useCallback(async () => {
setLoading(true);
const response = await client.query(getDataset(params.uri));
if (!response.errors && response.data.getDataset !== null) {
if (response.data.getDataset !== null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: expose all errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with Dashboards. I don't know if adding more error banners is going to give users more useful info. If the top level query fails, then the errors that we might find in the nested resolvers are very unpredictable. For example if get_Dataset fails because the dataset is not found, the restricted is going to fail with something as 'URI is null'; which might be misleading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it guaranteed the first error in the dispatch always the top level?

@@ -222,7 +222,7 @@ const TableView = () => {
const fetchItem = useCallback(async () => {
setLoading(true);
const response = await client.query(getDatasetTable(params.uri));
if (!response.errors && response.data.getDatasetTable !== null) {
if (response.data.getDatasetTable !== null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: expose all errors

frontend/src/utils/helpers/emptyPrintUnauthorized.js Outdated Show resolved Hide resolved
Comment on lines +257 to +262
dataset = dataset_fixture
dataset.GlueDatabaseName = dataset_fixture.restricted.GlueDatabaseName
dataset.region = dataset_fixture.restricted.region
dataset.S3BucketName = dataset_fixture.restricted.S3BucketName
dataset.AwsAccountId = dataset_fixture.restricted.AwsAccountId
table1 = table(dataset=dataset, name='table1', username=user.username)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any reason handled this way here and updated def table factory for a similar/same required change at tests/modules/s3_datasets/conftest.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very elegant, agreed. The issue is that the dataset_fixture parameter takes in this case the value of the gql Dataset type. While the table factory accepts db S3Dataset model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow - is it not the exact same dataset and table factories used in tests/modules/s3_datasets/conftest.py but in that conftest we define Table factor as

@pytest.fixture(scope='module', autouse=True)
def table(db):
    cache = {}

    def factory(dataset: S3Dataset, name, username) -> DatasetTable:
        key = f'{dataset.datasetUri}-{name}'
        if cache.get(key):
            return cache.get(key)
        with db.scoped_session() as session:
            table = DatasetTable(
                name=name,
                label=name,
                owner=username,
                datasetUri=dataset.datasetUri,
                GlueDatabaseName=dataset.restricted.GlueDatabaseName,
                GlueTableName=name,
                region=dataset.restricted.region,
                AWSAccountId=dataset.restricted.AwsAccountId,
                S3BucketName=dataset.restricted.S3BucketName,
                S3Prefix=f'{name}',
            )
...

(note the way we set GlueDatabaseName, region, S3BucketName, and S3BucketName)

@noah-paige
Copy link
Contributor

All minor enhancmenets / nits - the only pending comment I have before approval is regarding restrictions to additional dataset config info (re: #1727 (comment)) cc: @dlpzx

dlpzx added a commit that referenced this pull request Dec 6, 2024
### Feature or Bugfix
- Feature
- Bugfix

### Detail
This is a continuation of
#1727 but for Dashboards.

- removes GET_DASHBOARD permission check on get_dashboard
- adds a restriction field in the Dashboard type with the restricted
information
- implement resolver to return the restricted information that checks
GET_DASHBOARD permissions. Return defaults for users with no permissions
- Adapt frontend to use the restricted fields

In addition i made some fixes in the frontend, for example, the default
view I changed it to Overview because it improves the experience of a
data consumer that is exploring the dashboard metadata. I also removed
the forever circular progress in the viewer page if there are errors in
the loading of the reader url.

### Relates
- #1727 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@noah-paige
Copy link
Contributor

@dlpzx - I think will have to merge latest from OS, resolve conflicts, and make changes to test_permissions for the changes in auth checks / nested resolvers

Then good to go

dlpzx added 2 commits December 9, 2024 08:52
# Conflicts:
#	backend/dataall/modules/s3_datasets_shares/api/resolvers.py
#	frontend/src/modules/Folders/components/FolderOverview.js
@dlpzx dlpzx requested a review from noah-paige December 9, 2024 08:22
dlpzx added a commit that referenced this pull request Dec 9, 2024
- Feature
- Bugfix

This is a continuation of
#1727 but for Dashboards.

- removes GET_DASHBOARD permission check on get_dashboard
- adds a restriction field in the Dashboard type with the restricted
information
- implement resolver to return the restricted information that checks
GET_DASHBOARD permissions. Return defaults for users with no permissions
- Adapt frontend to use the restricted fields

In addition i made some fixes in the frontend, for example, the default
view I changed it to Overview because it improves the experience of a
data consumer that is exploring the dashboard metadata. I also removed
the forever circular progress in the viewer page if there are errors in
the loading of the reader url.

- #1727

Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
@dlpzx dlpzx merged commit 3d97d9e into main Dec 10, 2024
9 checks passed
dlpzx added a commit that referenced this pull request Dec 10, 2024
- Feature/Bugfix

This PR is the first part of a series and only handles S3_Datasets and
its child components Tables and Folders

Most API calls on a particular resource are restricted by GET_RESOURCE
permissions. But for resources that are indexed in the Catalog, some
metadata is considered public as it is useful for data consumers to
discover and understand the data assets. Users will click on these
resources from the Catalog view and call one of the following API calls:
- getDataset
- getDatasetStorageLocation
- getDatasetTable
- getDashboard
- getRedshiftDataset
- getRedshiftTable

From the above list, initially all are decorated with
resource_permission checks except for getDataset and getDatasetTable.

- Public information should be available for data consumers to explore,
that means that we first need to remove the resource_permission checks
from the list of APIs.
- Not all information is public, to get AWS information and other
restricted metadata we still need to verify GET_X permissions on the
resource.
- For restricted metadata, we should provide default messages that do
not break the frontend

In addition in this PR some unused fields are removed and `syncTables`
is simplified to return an integer with the count of synced tables

For each of the following I tested with a user with GET permissions and
without GET permissions. FE views render and show information or
unauthorized to view info placeholder
- [X] Dataset View, Dataset Edit Form and list Datasets
- [x] Dataset Data tab with list of tables and folders
- [X] Table view, Table Edit
- [X] Folder view and Folder Edit

Other checks
- [x] Complete share request
- [X] With requester check table and folder and view the details of the
account...
- [X] Worksheets query with owned table
- [X] Worksheets query with shared table
- [X] Crawl dataset - correct S3 path
- [X] Sync tables - correct params

Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants